Background: AllLife Bank wants to focus on its credit card customer base in the next financial year. They have been advised by their marketing research team, that the penetration in the market can be improved. Based on this input, the Marketing team proposes to run personalised campaigns to target new customers as well as upsell to existing customers. Another insight from the market research was that the customers perceive the support services of the back poorly. Based on this, the Operations team wants to upgrade the service delivery model, to ensure that customers queries are resolved faster. Head of Marketing and Head of Delivery both decide to reach out to the Data Science team for help.
To identify different segments in the existing customer based on their spending patterns as well as past interaction with the bank.
How many different segments of customers are there?
How are these segments different from each other?
What are your recommendations to the bank on how to better market to and service these customers?
Customer key - Identifier for the customer
Average Credit Limit - Average credit limit across all the credit cards
Total credit cards - Total number of credit cards
Total visits bank - Total number of bank visits
Total visits online - total number of online visits
Total calls made - Total number of calls made by the customer
Perform univariate analysis on the data to better understand the variables at your disposal and to get an idea about the no of clusters. Perform EDA, create visualizations to explore data. (10 marks)
Properly comment on the codes, provide explanations of the steps taken in the notebook and conclude your insights from the graphs. (5 marks)
Execute K-means clustering use elbow plot and analyse clusters using boxplot (10 marks)
Execute hierarchical clustering (with different linkages) with the help of dendrogram and cophenetic coeff. Analyse clusters formed using boxplot (15 marks)
Calculate average silhouette score for both methods. (5 marks)
Compare K-means clusters with Hierarchical clusters. (5 marks)
Analysis the clusters formed, tell us how is one cluster different from another and answer all the key questions. (10 marks)
# To enable plotting graphs in Jupyter notebook
%matplotlib inline
# Numerical libraries
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.cluster import KMeans
# to handle data in form of rows and columns
import pandas as pd
# importing ploting libraries
import matplotlib.pyplot as plt
#importing seaborn for statistical plots
import seaborn as sns
from sklearn import metrics
import pandas as pd
# reading the CSV file into pandas dataframe
mydata = pd.read_excel("Credit Card Customer Data.xlsx")
mydata.head()
#Info of the dataset
mydata.info()
mydata.describe().transpose()
mydata.isnull().sum()
# Data type of the columns
mydata.dtypes
#Creating Profile Report for Analysis
#!pip install pandas_profiling
import pandas_profiling
mydata.profile_report()
mydata = mydata.drop_duplicates(subset=None, keep="first", inplace=False)
mydata.info()
mydata_count = mydata.groupby(['Customer Key'], sort=False).size().reset_index(name='Count')
mydata_count
duplicateRows=mydata_count[mydata_count['Count']==2]
print (duplicateRows)
for index, row in duplicateRows.iterrows():
rec=mydata[mydata['Customer Key']==row['Customer Key']]
for index, row in rec.iterrows():
delrecord=row
print(delrecord)
rec1=mydata[mydata['Customer Key']==47437] [mydata['Avg_Credit_Limit']==17000].index
mydata = mydata.drop(rec1)
mydata = mydata.drop(mydata[mydata['Customer Key']==37252] [mydata['Avg_Credit_Limit']==6000].index)
mydata = mydata.drop(mydata[mydata['Customer Key']==97935] [mydata['Avg_Credit_Limit']==17000].index)
mydata = mydata.drop(mydata[mydata['Customer Key']==96929] [mydata['Avg_Credit_Limit']==13000].index)
mydata = mydata.drop(mydata[mydata['Customer Key']==50706] [mydata['Avg_Credit_Limit']==44000].index)
mydata.info()
mydata.drop('Sl_No', axis=1, inplace=True)
mydata.drop('Customer Key', axis=1, inplace=True)
import itertools
cols = [i for i in mydata.columns if i != 'strength']
fig = plt.figure(figsize=(15, 20))
for i,j in itertools.zip_longest(cols, range(len(cols))):
plt.subplot(4,2,j+1)
ax = sns.distplot(mydata[i],color='green',rug=True)
plt.axvline(mydata[i].mean(),linestyle="dashed",label="mean", color='black')
plt.legend()
plt.title(i)
plt.xlabel("")
plt.figure(figsize= (20,15))
plt.subplot(3,3,1)
sns.boxplot(x= mydata.Avg_Credit_Limit, color='green')
plt.subplot(3,3,2)
sns.boxplot(x= mydata.Total_Credit_Cards, color='green')
plt.subplot(3,3,3)
sns.boxplot(x= mydata.Total_visits_bank, color='green')
plt.show()
plt.figure(figsize= (20,15))
plt.subplot(4,4,1)
sns.boxplot(x= mydata.Total_visits_online, color='green')
plt.subplot(4,4,2)
sns.boxplot(x= mydata.Total_calls_made, color='green')
plt.show()
#SKEWNESS
mydata.skew(axis = 0, skipna = True)
mydata.info()
saveDF = mydata.copy()
corr = mydata.corr()
sns.heatmap(corr, annot = True)
import seaborn as sns
sns.pairplot(mydata, diag_kind='kde')
mydata['Avg_Credit_Limit'].value_counts()
mydata['Total_Credit_Cards'].value_counts()
mydata['Total_visits_bank'].value_counts()
mydata['Total_visits_online'].value_counts()
mydata['Total_calls_made'].value_counts()
sns.boxplot(x='Total_Credit_Cards',y='Total_calls_made',data=mydata)
sns.boxplot(x='Total_Credit_Cards',y='Total_visits_online',data=mydata)
sns.boxplot(x='Total_Credit_Cards',y='Total_visits_bank',data=mydata)
sns.boxplot(x='Total_Credit_Cards',y='Avg_Credit_Limit',data=mydata)
mydata.corr()
pd.crosstab(mydata['Total_visits_bank'],mydata['Total_calls_made'], normalize='columns')
pd.crosstab(mydata['Total_Credit_Cards'],mydata['Total_calls_made'], normalize='columns')
pd.crosstab(mydata['Total_Credit_Cards'],mydata['Total_visits_bank'], normalize='columns')
##Scale the data
from scipy.stats import zscore
mydata_scaled = mydata.apply(zscore)
##Scale the data
from scipy.stats import zscore
cluster_sil_scores = []
cluster_errors = []
cluster_range = range( 2, 15)
for num_clusters in cluster_range:
clusters = KMeans( num_clusters, n_init = 10, random_state=5)
clusters.fit(mydata_scaled)
labels = clusters.labels_ # capture the cluster lables
centroids = clusters.cluster_centers_ # capture the centroids
cluster_errors.append( clusters.inertia_ ) # capture the intertia
cluster_sil_scores.append(metrics.silhouette_score(mydata_scaled, labels, metric='euclidean'))
# combine the cluster_range and cluster_errors into a dataframe by combining them
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors , "Avg Sil Score": cluster_sil_scores } )
clusters_df[0:15]
# Elbow plot
plt.figure(figsize=(12,6))
plt.plot( clusters_df.num_clusters, clusters_df.cluster_errors, marker = "o")
#Finding optimal no. of clusters
from scipy.spatial.distance import cdist
clusters=range(1,22)
meanDistortions=[]
for k in clusters:
model=KMeans(n_clusters=k)
model.fit(mydata_scaled)
prediction=model.predict(mydata_scaled)
meanDistortions.append(sum(np.min(cdist(mydata_scaled, model.cluster_centers_, 'euclidean'), axis=1)) / mydata_scaled.shape[0])
plt.plot(clusters, meanDistortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Average distortion')
plt.title('Selecting k with the Elbow Method')
# Let us first start with K = 3
final_model=KMeans(3)
final_model.fit(mydata_scaled)
prediction=final_model.predict(mydata_scaled)
#Append the prediction
mydata_scaled["GROUP"] = prediction
mydata["GROUP"] = prediction
print("Groups Assigned : \n")
mydata.head()
#Visualising the clusters
x = mydata_scaled.values
y_kmeans=prediction
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s = 100, c = 'green', label = 'Group 1')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s = 100, c = 'skyblue', label = 'Group 2')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s = 100, c = 'orange', label = 'Group 3')
#Plotting the centroids of the clusters
plt.scatter(final_model.cluster_centers_[:, 0], final_model.cluster_centers_[:,1], s = 100, c = 'yellow', label = 'Centroids')
plt.legend()
# Box plot to visualize GROUP vs Avg_Credit_Limit
sns.boxplot(x='GROUP', y='Total_Credit_Cards', data=mydata)
sns.boxplot(x='GROUP', y='Avg_Credit_Limit', data=mydata)
sns.boxplot(x='GROUP', y='Total_visits_online', data=mydata)
sns.boxplot(x='GROUP', y='Total_visits_bank', data=mydata)
sns.boxplot(x='GROUP', y='Total_calls_made', data=mydata)
mydata_group = mydata.groupby(['GROUP'])
mydata_group.mean()
mydata_scaled.boxplot(by='GROUP', layout = (2,4),figsize=(15,10))
final_model=KMeans(5)
final_model.fit(mydata_scaled)
prediction=final_model.predict(mydata_scaled)
#Append the prediction
mydata_scaled["GROUP"] = prediction
mydata["GROUP"] = prediction
print("Groups Assigned : \n")
mydata.head()
mydata_group = mydata.groupby(['GROUP'])
mydata_group.mean()
mydata_scaled.boxplot(by='GROUP', layout = (2,4),figsize=(15,10))
mydata_scaled.info()
print(final_model.cluster_centers_)
#Avg_Credit_Limit Total_Credit_Cards Total_visits_bank Total_visits_online Total_calls_made
cols =["Avg_Credit_Limit","Total_Credit_Cards","Total_visits_bank","Total_visits_online","Total_calls_made","GROUP"]
km_centers =pd.DataFrame(final_model.cluster_centers_,columns=cols)
km_centers.plot.bar(ylim=[0,2],fontsize=10)
km_centers
# Imports
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
%config InlineBackend.figure_format='retina'
# Load in the data
df = pd.read_excel("Credit Card Customer Data.xlsx")
# Standardize the data to have a mean of ~0 and a variance of 1
X_std = StandardScaler().fit_transform(df)
# Create a PCA instance: pca
pca = PCA(n_components=3)
principalComponents = pca.fit_transform(X_std)
# Plot the explained variances
features = range(pca.n_components_)
plt.bar(features, pca.explained_variance_ratio_, color='black')
plt.xlabel('PCA features')
plt.ylabel('variance %')
plt.xticks(features)
# Save components to a DataFrame
PCA_components = pd.DataFrame(principalComponents)
plt.scatter(PCA_components[0], PCA_components[1], alpha=.1, color='black')
plt.xlabel('PCA 1')
plt.ylabel('PCA 2')
ks = range(1, 10)
inertias = []
for k in ks:
# Create a KMeans instance with k clusters: model
model = KMeans(n_clusters=k)
# Fit model to samples
model.fit(PCA_components.iloc[:,:3])
# Append the inertia to the list of inertias
inertias.append(model.inertia_)
plt.plot(ks, inertias, '-o', color='black')
plt.xlabel('number of clusters, k')
plt.ylabel('inertia')
plt.xticks(ks)
plt.show()
model1 = KMeans(n_clusters=2)
# Fit model to samples
model1.fit(PCA_components.iloc[:,:3])
prediction1=model1.predict(PCA_components)
PCA_components["GROUPS"] = prediction1
PCA_components.head()
PCA_components.groupby(['GROUPS']).mean()
from sklearn.cluster import AgglomerativeClustering
# reading the CSV file into pandas dataframe
mydata_h = saveDF.copy()
mydata_h_scaled = mydata_h.apply(zscore)
mydata_h.info()
mydata_h_scaled = mydata_h.apply(zscore)
mydata_h.info()
model = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='average')
model.fit(mydata_h_scaled)
mydata_h['labels'] = model.labels_
mydata_h_scaled["labels"] = prediction
mydata_h.head(100)
mydata_h_scaled.info()
custDataClust = mydata_h.groupby(['labels'])
from scipy.cluster.hierarchy import cophenet, dendrogram, linkage
from scipy.spatial.distance import pdist #Pairwise distribution between data points
# cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram
# closer it is to 1, the better is the clustering
Z = linkage(mydata_h_scaled, metric='euclidean', method='average')
c, coph_dists = cophenet(Z , pdist(mydata_h_scaled))
c
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram with Linkage-Average')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z,leaf_rotation=90.0,p=5,leaf_font_size=10,truncate_mode='level')
plt.tight_layout()
custDataClust.mean()
# reading the CSV file into pandas dataframe
mydata_w = saveDF.copy()
mydata_w_scaled = mydata_w.apply(zscore)
mydata_w.info()
model_ward = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='ward')
model_ward.fit(mydata_w_scaled)
mydata_w['labels_ward'] = model_ward.labels_
mydata_w_scaled["labels_ward"] = prediction
mydata_w.head(100)
mydata_w.groupby(['labels_ward']).mean()
Z = linkage(mydata_h_scaled, metric='euclidean', method='ward')
c, coph_dists = cophenet(Z , pdist(mydata_w_scaled))
c
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram with Linkage-ward')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z,leaf_rotation=90.0,p=5,leaf_font_size=10,truncate_mode='level')
plt.tight_layout()
# reading the CSV file into pandas dataframe
mydata_single = saveDF.copy()
mydata_single_scaled = mydata_single.apply(zscore)
mydata_single.info()
model_single = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='single')
model_single.fit(mydata_single_scaled)
mydata_single['labels_single'] = model_single.labels_
mydata_single_scaled["labels_single"] = prediction
mydata_single.head(100)
mydata_single.groupby(['labels_single']).mean()
Z = linkage(mydata_h_scaled, metric='euclidean', method='single')
c, coph_dists = cophenet(Z , pdist(mydata_single_scaled))
c
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram with Linkage-Single')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z,leaf_rotation=90.0,p=5,leaf_font_size=10,truncate_mode='level')
plt.tight_layout()
# reading the CSV file into pandas dataframe
mydata_complete = saveDF.copy()
mydata_complete_scaled = mydata_complete.apply(zscore)
mydata_complete.info()
model_complete = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='complete')
model_complete.fit(mydata_complete_scaled)
mydata_complete['labels_Complete'] = model_complete.labels_
mydata_complete_scaled["labels_Complete"] = prediction
mydata_complete.head(100)
mydata_complete.groupby(['labels_Complete']).mean()
# cophenet index is a measure of the correlation between the distance of points in feature space and distance on dendrogram
# closer it is to 1, the better is the clustering
Z = linkage(mydata_complete_scaled, metric='euclidean', method='complete')
c, coph_dists = cophenet(Z , pdist(mydata_complete_scaled))
c
plt.figure(figsize=(10, 5))
plt.title('Agglomerative Hierarchical Clustering Dendogram-Complete')
plt.xlabel('sample index')
plt.ylabel('Distance')
dendrogram(Z,leaf_rotation=90.0,p=5,leaf_font_size=10,truncate_mode='level')
plt.tight_layout()
mydata_h_scaled.boxplot(by='labels', layout = (2,4),figsize=(15,10))
# Visualizing the clustering
plt.scatter(mydata_complete_scaled['Avg_Credit_Limit'], mydata_complete_scaled['Total_visits_online'],
c = AgglomerativeClustering(n_clusters = 3).fit_predict(mydata_complete_scaled), cmap =plt.cm.winter)
plt.show()
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import StandardScaler, normalize
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
import scipy.cluster.hierarchy as shc
from sklearn.metrics import silhouette_samples
# reading the CSV file into pandas dataframe
mydata = saveDF.copy()
mydata_scaled = mydata.apply(zscore)
mydata_scaled.info()
colorlist =["tomato","antiquewhite","blueviolet","cornflowerblue","darkgreen","seashell","skyblue","mediumseagreen"]
n_clusters=3
km = KMeans(n_clusters=3,init="k-means++",n_init=10,max_iter=300)
km.fit(mydata)
cluster_labels = km.predict(mydata)
#Calculate the average of silhouette scores
silhouette_avg = silhouette_score(mydata,cluster_labels)
#Calculate the silhouette score for each data
each_silhouette_score = silhouette_samples(mydata,cluster_labels,metric="euclidean")
#Visualization
fig =plt.figure()
ax = fig.add_subplot(1,1,1)
y_lower =10
for i in range(n_clusters):
ith_cluster_silhouette_values = each_silhouette_score[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = colorlist[i]
ax.fill_betweenx(np.arange(y_lower,y_upper),0,ith_cluster_silhouette_values,facecolor=color,edgecolor=color,alpha=0.3)
#label the silhouse plots with their cluster numbers at the middle
ax.text(-0.05,y_lower + 0.5 * size_cluster_i,str(i))
#compute the new y_lower for next plot
y_lower = y_upper +10
ax.set_title("Silhuoette plot-KMEANS")
ax.set_xlabel("silhouette score")
ax.set_ylabel("Cluster label")
#the vertical line for average silhouette score of all the values
ax.axvline(x=silhouette_avg,color="red",linestyle="--")
ax.set_yticks([])
ax.set_xticks([-0.2,0,0.2,0.4,0.6,0.8,1])
# reading the CSV file into pandas dataframe
mydata = saveDF.copy()
mydata_scaled = mydata.apply(zscore)
mydata_scaled.info()
colorlist =["tomato","antiquewhite","blueviolet","cornflowerblue","darkgreen","seashell","skyblue","mediumseagreen"]
n_clusters=2
km = AgglomerativeClustering(n_clusters=2, affinity='euclidean', linkage='average')
km.fit(mydata)
#Calculate the average of silhouette scores
silhouette_avg = silhouette_score(mydata,cluster_labels)
#Calculate the silhouette score for each data
each_silhouette_score = silhouette_samples(mydata,cluster_labels,metric="euclidean")
#Visualization
fig =plt.figure()
ax = fig.add_subplot(1,1,1)
y_lower =10
for i in range(n_clusters):
ith_cluster_silhouette_values = each_silhouette_score[cluster_labels == i]
ith_cluster_silhouette_values.sort()
size_cluster_i = ith_cluster_silhouette_values.shape[0]
y_upper = y_lower + size_cluster_i
color = colorlist[i]
ax.fill_betweenx(np.arange(y_lower,y_upper),0,ith_cluster_silhouette_values,facecolor=color,edgecolor=color,alpha=0.3)
#label the silhouse plots with their cluster numbers at the middle
ax.text(-0.05,y_lower + 0.5 * size_cluster_i,str(i))
#compute the new y_lower for next plot
y_lower = y_upper +10
ax.set_title("Silhuoette plot-Hierarchical clusters")
ax.set_xlabel("silhouette score")
ax.set_ylabel("Cluster label")
#the vertical line for average silhouette score of all the values
ax.axvline(x=silhouette_avg,color="red",linestyle="--")
ax.set_yticks([])
ax.set_xticks([-0.2,0,0.2,0.4,0.6,0.8,1])
##Scale the data
from scipy.stats import zscore
cluster_sil_scores = []
cluster_errors = []
cluster_range = range( 2, 15)
for num_clusters in cluster_range:
clusters = KMeans( num_clusters, n_init = 10, random_state=5)
clusters.fit(mydata_scaled)
labels = clusters.labels_ # capture the cluster lables
centroids = clusters.cluster_centers_ # capture the centroids
cluster_errors.append( clusters.inertia_ ) # capture the intertia
cluster_sil_scores.append(metrics.silhouette_score(mydata_scaled, labels, metric='euclidean'))
# combine the cluster_range and cluster_errors into a dataframe by combining them
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "cluster_errors": cluster_errors , "Avg Sil Score": cluster_sil_scores } )
clusters_df[0:15]
##Scale the data
from scipy.stats import zscore
cluster_sil_scores = []
cluster_errors = []
cluster_range = range( 2, 15)
for num_clusters in cluster_range:
clusters = AgglomerativeClustering(n_clusters=num_clusters, affinity='euclidean', linkage='average')
clusters.fit(mydata)
labels = clusters.labels_ # capture the cluster lables) # capture the intertia
cluster_sil_scores.append(metrics.silhouette_score(mydata, labels, metric='euclidean'))
# combine the cluster_range and cluster_errors into a dataframe by combining them
clusters_df = pd.DataFrame( { "num_clusters":cluster_range, "Avg Sil Score": cluster_sil_scores } )
clusters_df[0:15]
# Standardize data
scaler = StandardScaler()
scaled_df = scaler.fit_transform(mydata)
# Normalizing the Data
normalized_df = normalize(scaled_df)
# Converting the numpy array into a pandas DataFrame
normalized_df = pd.DataFrame(normalized_df)
# Reducing the dimensions of the data
pca = PCA(n_components = 2)
X_principal = pca.fit_transform(normalized_df)
X_principal = pd.DataFrame(X_principal)
X_principal.columns = ['P1', 'P2']
X_principal.head(2)
plt.figure(figsize =(6, 6))
plt.title('Visualising the data')
Dendrogram = shc.dendrogram((shc.linkage(X_principal, method ='ward')))
K-Means has three clusters
Hierarichal has Two clusters
looks like Cluster 0 has more data in KMEANS and Cluster 0 in case of Hierarchical Method
1.Less Credit Limit
2.Less Number of Credit Cards
3.More Calls Made/Online Visits when Compared to Visiting Bank
1.More Credit Limit
2.More Number of Credit Cards
3.More Online Visits when Compared to Visiting Bank/Calls
1.Average Credit Limit
2.Average Number of Credit Cards
3.More Visiting Bank/Calls compared to online
Based on data amd analysis we can have 2 segments of customers
LOW/Average Credit Customer(Could be New Customer)
High Credit Limit Users(Existing/Good/Trusted Customer)
These segments in different terms of
How customer interacting to resolve their queries
Few Customers choose Online when compared to calls or visit & vice versa
For New Customers/Average Credit limit customer Looks like focus should be on Good Support for Bank Visit Campigns /Support for Calls to solve queries
For existing customers as long as good support for Online (Chat..etc) will provide more customer satisfaction
So As per analysis Need to Increase more support for Online retain existing customer when comapred to setup infrastructure for bank visits .
More Campigns at bank locations to attract new customers to visit and solve there queries via calls support when comapared to online.Calls/in Person Support